3 research outputs found
Language-independent pre-processing of large document bases for text classification
Text classification is a well-known topic in the research of knowledge discovery in
databases. Algorithms for text classification generally involve two stages. The first
is concerned with identification of textual features (i.e. words andlor phrases) that
may be relevant to the classification process. The second is concerned with
classification rule mining and categorisation of "unseen" textual data. The first
stage is the subject of this thesis and often involves an analysis of text that is both
language-specific (and possibly domain-specific), and that may also be
computationally costly especially when dealing with large datasets. Existing
approaches to this stage are not, therefore, generally applicable to all languages. In
this thesis, we examine a number of alternative keyword selection methods and
phrase generation strategies, coupled with two potential significant word list
construction mechanisms and two final significant word selection mechanisms, to
identify such words andlor phrases in a given textual dataset that are expected to
serve to distinguish between classes, by simple, language-independent statistical
properties. We present experimental results, using common (large) textual datasets
presented in two distinct languages, to show that the proposed approaches can
produce good performance with respect to both classification accuracy and
processing efficiency. In other words, the study presented in this thesis
demonstrates the possibility of efficiently solving the traditional text classification
problem in a language-independent (also domain-independent) manner
Language-independent pre-processing of large document bases for text classification
EThOS - Electronic Theses Online ServiceGBUnited Kingdo